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Abstract 



This study shows that a mixture of RNN experts model can acquire the ability to generate 
sequences combining multiple primitive patterns by means of self-organizing chaos. By training 
of the model, each expert learns a primitive sequence pattern, and a gating network learns to 
imitate stochastic switching of the multiple primitives via a chaotic dynamics, utilizing a sensitive 
dependence on initial conditions. As a demonstration, we present a numerical simulation in which 
the model learns Markov chain switching among some Lissajous curves by a chaotic dynamics. 
Our analysis shows that by using a sufficient amount of training data, balanced with the network 
memory capacity, it is possible to satisfy the conditions for embedding the target stochastic 
sequences into a chaotic dynamical system. It is also shown that reconstruction of a stochastic 
time series by a chaotic model can be stabilized by adding a negligible amount of noise to the 
dynamics of the model. 

1 Introduction 

Recurrent neural networks (RNNs) [31 [TTl [T51 [Mj are abstract models of nervous systems that can 
perform temporal sequence processing. RNNs have been applied to temporal sequence learning such 
as sensory- motor sequence patterns [TTl 112] ) symbolic sequences with grammar [31 [TB] and continuous 
spatio-temporal patterns [H]. However, in spite of the considerable amount of RNN research carried 
out since the mid 1980s, it has been thought that RNNs could not be scaled so as to be capable of 
learning complex sequence patterns, especially when the sequence patterns to be learned contain long- 
term dependencies f2]. This is due to the fact that the error signal cannot be propagated effectively 
in long-time windows of sequences using the back- propagation through time (BPTT) algorithm |20| , 
because of the potential nonlinearity of the RNN dynamics [^ • How to form neural network models 
that can learn complex temporal sequence patterns has proved to be a challenging problem. 

There have been some breakthroughs in approaches to this problem. Echo state networks [21 ITU] 
and a very similar approach, liquid state machines 15J, have recently attracted considerable attention. 
An echo state network possesses a large pool of hidden neurons with fixed random weights, and the 
pool is capable of rich dynamics. Jaeger [TD] demonstrated that an echo state network can successfully 
learn the Mackey- Glass chaotic time series which is a well-known benchmark system for time series 
prediction. Long Short-Term Memory (LSTM) proposed by Hochreiter and Schmidhuber [T] [H] has 
been investigated as a neural network architecture which is more efficient than standard RNN in 
terms of memory capability. LSTM learns to bridge minimal time lags in long time steps by enforcing 
constant error flow through "constant error carousels" within special units. These units learn to open 
and close access to the constant error flow. 



Tani and Nolfi. [23] investigated the same problems, but from a different angle, focusing on the 
idea of compositionality for sensory-motor learning. The term compositionality was adopted from 
the "Principle of Compositionality" 'SJ in linguistics, which claims that the meaning of a complex 
expression is determined by the meanings of its constituent expressions and rules used to combine 
them. This principle, when translated to the context of sensory-motor learning, leads to the assertion 
that varied and complex patterns can be learned by adaptively combining reusable behavior primitives. 
Here, acquiring behavior primitives requires a mechanism for autonomously segmenting a continuously 
experienced sensory-motor flow into reusable chunks. Tani and Nolfi proposed a scheme for hierarchical 
segmentation of the sensory-motor flow, applying the idea of a mixture of experts [51 [T31 [5S] to 
hierarchically organized RNNs. Consider a network consisting of multiple local RNNs, organized in 
two levels, referred to as the base level and a higher level. At the base level, RNNs called experts 
competitively learn prediction or generation of specific sensory-motor profiles. As the winner among 
the experts changes, corresponding to structural changes in the sensory-motor flow, the sensory-motor 
flow is segmented by switching between winners. Meanwhile, an RNN at the higher level, usually called 
a gating network, learns the sequence patterns of winner switching with much slower time constants 
than those at the base level. The higher level learns not for details of sensory-motor patterns but for 
abstraction of primitive sequences with long-term dependencies. 

However, how to practically perform learning at the higher level is a difficult problem, because the 
gating network is usually a deterministic system, whereas arbitrary composition of primitives with a 
certain probability distribution brings nondeterminacy to the sequence patterns of winner switching. 
A possibly easy solution to this problem is to apply the output values of the gating network as a 
probability of winner switching, or more directly, a stochastic model such as a hidden Markov model 
(HMM) is used instead of the gating network. Of course, if we use a stochastic model at the higher 
level, nondeterministic sequences can be learned by the model. Nevertheless, using a stochastic model 
might remove from the dynamics the stability brought about by the effect of basins of attraction 
of RNN experts, because stochastic models such as HMMs can operate only with discrete value 
sequences, so a trajectory jumps discontinuously whenever a winner changes. In particular, when 
sequences to be learned are continuous, for example, a sensory-motor flow, the stability might be 
lost by using a stochastic model, because sequence patterns of winner switching are also continuous. 
An alternative solution, which is the main focus of the current paper, is to imitate nondeterminacy 
by using the gating network itself. It has been reported that a nondeterministic process can be 
imitated by a chaotic dynamical system from the view of symbolic dynamics [6l I14j. Moreover, Tani 
et al. [171 [22] showed that an RNN can acquire the ability to imitate a nondeterministic symbolic 
process by experimental simulations. The approach imitating a nondeterministic process using the 
gating network offers the advantage not only of preserving the dynamical stability, but also that of a 
consistent approach in terms of dynamical systems theory to both the base and higher levels. 

In the current study, we investigate learning for a mixture of experts model focusing on the higher 
level, and show that the model can learn nondeterministic sequence patterns of winner switching 
by a self-organizing internal chaos. In the previous study [16] . we demonstrated that a mixture of 
experts model with our proposed learning method was able to learn to segment time series consisting 
of many primitive patterns, whereas in earlier studies such a series could not be segmented owing to 
near-miss problems in matching the current sequence pattern to the best expert among others which 
had acquired similar pattern profiles. The previous study, however, focused on the learning process 
of segmenting sequences into reusable primitives with respect to spatio-temporal patterns, and so did 
not deal with higher level learning. This study is therefore focused on higher level learning. 

2 Model 

A mixture of RNN experts is simply a mixture of experts model for which experts are RNNs. The 
mixture of RNN experts consists of expert networks together with a gating network (see Figure [T]) . 
All experts receive the same input and have the same number of output neurons. The gating network 
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Figure 1: A system comprised of a mixture of RNN experts. 



receives the past gate opening values and the input and controls gate opening. The role of each expert 
is to compute a specific input-output function, and the role of the gating network is to decide which 
single expert is the winner on each occasion. 

The dynamic states of the mixture of RNN experts at time n are updated according to 
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where N is the number of experts, a;„ and y„ are an input and output of the model respectively. For 
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each i, gn\ Cn^ and yi*-* denote the gate opening value, context and output of the expert network i 



respectively. Here we assume gii > and J2i=i9n' = 1- The gate opening vector g„ represents the 
winner-take-all competition among experts to determine the output The gate opening vector gr^ 
is the output of a gating network defined by 
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where r represents a feedback time delay. In order to satisfy gn^ > and 9^' = I7 equation ^ 

is given by the soft-max function. Using the sigmoid fimction denoted by sigmoid{x) ~ i+cxp(-z) ' 
the equation ([8]) can be expressed as 
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This equation indicates that the output of the gating network is given by the sigmoid function with 
global suppression. 
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2.1 Learning Method 
2.1.1 Basic procedure 

We present the procedure for training the mixture of RNN experts, organized into two phases as 
foUows: 

(1) We compute the experts learning together with an estimate of the gate opening vector. 

(2) Using the estimate g„ of gate opening vector as a target, we compute learning for the gating 
network. 

Note that in phase (1) we use the method proposed in [16j . This learning is a maximum likehhood 
estimation using the gradient descent method. The training procedure progresses with phase (2) after 
convergence of phase (1). 



2.1.2 Probability distribution 

In order to define the learning method, we assign a probability distribution to the mixture of RNN 
experts. Here, learnable parameters of an expert network i and a gating network are denoted by z9i = 
{Wi"\w^"\w^"\v["\vi"\ui"'>) and 6>» = (Wf ,Wi ,Wi ,W! ,v'i,vl,uf,) , respectively An estimate 
for the gate opening vector is described by a variable /3„ such as 
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Let X = {xn)n=i be an input sequence, ^„ and 7 be parameters, where 7 is a set of parameters 
given by 7i = (i?i,(Ti). Given X, /3„, and 7, a probability density function (p.d.f.) for the output y„ 
is defined by 
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p(y„ I X, /3„, 7) = ^ 5«p(y„ | X, 7,), (U) 
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where p{y„\X, 7^) is given by 



d is the output dimension, and is the output from expert i computed by the equations (HI) , ([2]) and 
(|3l) with parameter iJ^. Thus, the output of the model is governed by a mixture of normal distributions. 
This equation results from the assumption that the observable sequence data is embedded in additive 
Gaussian noise. It is well known that minimizing the mean square error is equivalent to maximizing 
the likelihood determined by a normal distribution for learning in a single neural network. Therefore, 
equation p2p is a natural extension of neural network learning. The details of the derivation of 
equation (fT2|) are given in [8] . 

Given a parameter set 6 = ((/3„)^^xi7) ^^'^ input sequence X, the probability of an output 
sequence Y — (y„)J^i is given by 

T 

p{Y \X,e) = l[p{y,,\X,f3„,j). (13) 

The likelihood function L of a data set D = {X,Y), parameterized by 9, is denoted by 

LiD,e)^p{Y \X,e)ip{9), (14) 
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where (p{9) is the p.d.f. of a prior distribution given by 

,1=1 i=l V ^"S 

This equation means that the vector (3^ is governed by A^-dimensional Brownian motion. The prior 
distribution has the effect of suppressing the change of gate opening values. 

Assume that = (-'^^ (9n)n=i) is given. We define the Ukehhood L{D^,9^) by the Dirichlet 
distribution 

TV 

Pi9n\9n) = y^^t[{9l:Y"\ (16) 

T 

L{D^,e^) = l[p{g,^\gJ, (17) 

n=l 

where Z{g^) is the normahzation constant, and g„ is given by X and 6^ . Notice that g„ in equation 
(jl7p is calculated by the teacher forcing technique [24] . namely, using g^^r instead of g„_^ in equation 
([5]). In addition, because of the term £c„_t- on the right-hand side of equation ([S]), fif„ cannot be 
computed if n < r. Then it is assumed that g„ = g„ if n < t. If we consider g„ to be the 
probability of choosing an expert i at time n, maximizing \nL{D^ , 0^) is equivalent to minimizing the 
KuUback-Leibler divergence 

T N 4i) 

^KL((g„)Lill(9jLi)-EEff»^l'^(ll))- (18) 



2.1.3 Maximum likelihood estimation 

The learning method is to choose the best parameters 6 and 6^ by maximizing (or integrating over) 
the likelihood L{D,6) and L{D3,0^). More precisely, we use the gradient descent method with a 
momentum term as the training procedure. The model parameters at learning step t are updated 
according to 

Ae*{t)=a^^^^^^^^^p^+vAe*{t-l) wherep*,r)isp,0)orpf,0^), (19) 

e*{t + l) = 9*{t) + A9*{t), (20) 

where a is the learning rate and 77 is a momentum term parameter. For each parameter, the partial 
differential equations gg^'^'* are given by 
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Note that g|-||yn'' — and can be solved for by the back propagation through time (BPTT) 
method (20j . In this paper, we assume that the infimum a of the parameter ai is greater than 0, 
because if ai converges to 0, then A9{t) diverges. 

We have explained the case in which the training data set Z? is a single sequence, but the method 
can be easily extended to learning of several sequences by using the sum of the gradients for each 
sequence. When several sequences are used as training data, initial states Uq ^ and Uq, and an estimate 
g„ of the gate opening vector have to be provided for each sequence. 

2.1.4 Acceleration of gating network learning 

In principle, our training procedure is defined by equations (1191) and (|20p . However, for reasons which 
will be explained in section [ITTl learning for a gating network sometimes becomes unstable, and so the 
likelihood L{D^ , 9^) often decreases when updating the parameter 6^ with certain learning rates. Of 
course, if we use sufficiently small learning rates, the likelihood does not decrease for any learning steps; 
however, it does require considerable computation time. Hence, in order to accelerate gating network 
learning practically, we update the learning rate adaptively by means of the following algorithm. 

(1) For each learning step, updated parameters are computed by equations (|19p and (j20p applying 
the learning rate a, and the rate r defined by 



(26) 



^KL((fl'„)Llll(9„)Ll) 

is also computed, where and {gn)n=i are sequences of gate opening values corresponding 

to the current parameters and the updated parameters, respectively. 

(2) If r > rth, then a is replaced with aadoc and we return to (1). Otherwise go to (3). 

(3) If r < 1, then a is replaced with aai^c- Go to the next learning step. 

In this study, we use rth = 1.1, adoc = 0.7 and a-mc — 1-05. 
2.1.5 Initialization 

Here, we describe a policy for initialization of parameters and setting a learning rate in this paper. 
Every element of matrices and thresholds of a network, either an expert or a gating network, is 
initialized by choosing randomly from the uniform distribution on the interval [—jj,-^], where M 
is the number of context neurons. Initial states Uq"^ and Uq are also initialized randomly from the 
interval [—1, 1]. The parameters (/3„)^=i and cr are initialized such that /3„ — for each n and cr = 1, 
respectively. Since the maximum value of L{D, 6) depends on the total length T of training sequences 
and the dimension d of output neurons, we scale the learning rate a by a parameter a satisfying 

2.2 Feedback loop with time delay 

Let us consider the case in which there exists a feedback loop from output to input with a time delay 
r, that is to say, the model is an autonomous system. In this case, a training data set D = [X, Y) 
has to satisfy y„ = Xn+T- At the end of learning, if every output of the model is completely equal 
to the training data, the model can generate the sequence Y using a feedback loop without external 
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input data X. In this paper, the dynamics of the model with external input is referred to as open- 
loop dynamics, and dynamics with self- feedback is also called closed-loop dynamics. In section |31 we 
consider the case in which the model is an autonomous system. 



3 Numerical simulation 

We now demonstrate how the mixture of RNN experts learns to generate sequences given as tasks. 
Our first experiment in section [XT] is to learn a stochastic finite state automaton by means of a gating 
network, in order to test whether a gating network can output an estimate of the gate opening vector 
even if switching of winners is stochastic. We include this test for two reasons. (1) Since a time series 
of gate opening vectors represents an arbitrary composition of primitives, such a time series is usually 
a symbolic and nondeterministic sequence. (2) It has been shown that the simple recurrent network 
(SRN or Elman net) can learn symbolic and nondeterministic sequences such as declarative sentences 
[3l S] . The SRN has typically been used to learn to predict the probability of occurrence of the next 
symbol, and the output values of the SRN represented this probability. For example, if each symbol 
is to appear next with equal probability, the output values of the SRN are constant, summing to one. 
However, in the context of the mixture of RNN experts model, the weighted sum of output values of 
all experts by probability does not represent a correct value. In order to output the correct value, the 
gating network has to determine a winner among the experts, and the output of the gating network 
corresponding to the winner should be almost one. This means that the gating network has to learn to 
output not the probability but stochastic variables instead. It is not trivial to determine whether the 
gating network can learn this task, because the network is a deterministic dynamical system. At least, 
if the maximum Lyapunov exponent of the network is a negative value, the output of the network 
is periodic and so cannot represent any nondeterministic sequences. Basically, a necessary condition 
to output nondeterministic sequences using a deterministic dynamical system is that the maximum 
Lyapunov exponent is positive. 

On the other hand, it is known that there are chaotic dynamics which resemble nondeterministic 
sequences, because it is possible to create a correspondence between a hyperbolic dynamical system and 
a symbolic dynamics by constructing a Markov partition [14] . Tani et al. [17l [22] demonstrated that a 
nondeterministic symbolic process can be imitated by an RNN trained by the gradient descent method, 
utilizing initial sensitivity characteristics of a chaotic dynamics. However, they only investigated the 
case in which the length of training data is about 10. Thus, in our first experiment we show that the 
gating network can acquire a chaotic dynamics and can generate nondeterministic sequences, even if 
training data are much longer. 

Our second experiment in section 13.21 is to learn Markov chains whose primitives are Lissajous 
curve patterns. It was shown that a mixture of RNN experts can extract a set of reusable primitives 
from training data generated by a Markov chain of Lissajous curves [16j . but learning for the gating 
network was not considered. Therefore, we will show that a mixture of RNN experts can generate 
stochastically combined Lissajous curves by switching winners using a gating network. 

In order to evaluate the learning capability of a mixture of RNN experts, we first define some 
measures of fitness. An error E for expert networks is defined by the mean square error 



for each learning step, where y'f'"^'^ denotes the training data, and y„ = 9n Un ■ An error 

for the gating network is defined by the KuUback-Leibler divergence 




(27) 
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where we assume that E and are computed by teacher forcing, i.e., utiHzing the training data 
and estimates of gate opening values as network input. U E = E^ — 0, then the mixture of RNN 
experts can output sequences equivalent to the training data. Since E and E^ are measures for open- 
loop dynamics, we next define the KuUback-Leibler divergence to evaluate fitness in the closed-loop 
dynamics. 

<L^((^n)Llll(-n)Ll)= E P(«l(^n)Ll) In 4n#iH, (29) 

se{i,-.N}^ P^^K'*"^=i^ 
where s„ denotes the index of a current opening gate s„ e {1, • • • , N} defined by 

Sn = argmax^W (30) 

i 

and s„ is given by equation pop with respect to g„. By using this measure, we can perform an 
evaluation even if the training data are stochastic time series combining several primitive patterns. 

3.1 Preliminary experiment: Learning a stochastic finite state automaton 

Here, we investigate how a gating network learns time series of gate opening values which form 
stochastic sequences. As we focus on training of the gating network, we assume that the dimension of 
the input a;„ is d = 0. In other words, any effects of experts and the input-output sequence D = {X, Y) 
are omitted. As a benchmark system, we use a finite state automaton described in Figure O This 
automaton is given by modifying the Reber grammar to an Ergodic model. When we generate a 
sequence using the automaton, we choose a path randomly from two paths with equal probability. 
Representation of the symbols by gate opening values is such that each symbol is mapped with a one 
to one correspondence to an integer from 1 to 7, and if a symbol corresponding to i appears at time 
n, then gn"^ = 1. 

As an example, we computed learning for the gating network up to 10000 steps. Here we use 4 
training sequences of length T — 500 (so the total length of training sequences is 2000). The number 
of context neurons, time constant, time delay of a feedback loop, learning rate and momentum were 
chosen to be 20, = 0.1, t — 1, a — 0.01 and 77 = 0.9, respectively. In Figure [3l we depict the 
error, KuUback-Leibler divergence and maximum Lyapunov exponent for each learning step, where 
the Lyapunov exponent is computed for the closed-loop dynamics. Here, we computed the KuUback- 
Leibler divergence with m = 3, because the stochastic process defined by Figure [5] is a second order 
Markov chain. Since the maximum Lyapunov exponent becomes positive, the gating network acquires 
a chaotic dynamics by training. Moreover, acquiring the chaotic dynamics is effective for imitation 
of the stochastic process, because the KuUback-Leibler divergence decreases with the increase in the 
maximum Lyapunov exponent. Figure [?] displays the dynamics of the trained network with a random 
initial state. It can be seen that the output of the network is almost always or 1. This result indicates 
that the network learned to output not the probability but stochastic variables. Furthermore, there is 
a case for which the output of the network does not satisfy the syntax of the Reber grammar (printed 
as red color symbols over the gate opening values in Figure [4]), but the dynamics recovers the correct 
output patterns. 

Table [1] shows the results of training for each number of context neurons. For each case, we 
computed 50 samples up to 10000 learning steps with different initial conditions, where we use the same 
parameters as for the above training example without the number of context neurons. In addition, in 
order to evaluate the generalization capability of trained networks, we prepared test data separate from 
the training data, and we computed the KuUback-Leibler divergence {{sn)n=i\\{sn)n=i) for the 
test data with random initial states. It can be seen that if there exist sufficient context neurons to learn 
the training data, the dynamics of some trained networks become chaotic and the KuUback-Leibler 
divergences between sequences generated by the networks and training/test data become sufficiently 
small. 
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Figure 2: Finite state automaton based on the Reber grammar. If there are two paths, then we 
randomly choose one with equal probability. 




Figure 3: (a) The error E^. (b) KuUback-Leibler divergence -D|^L((^")n=ill(*n)n=i)- i^) Maximum 
Lyapunov exponent. 
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Figure 4: A snapshot of and of the trained network with closed-loop dynamics. Each label 
over gate opening values denotes a symbol corresponding to the current opening gate, and a red label 
means that the symbol does not satisfy the syntax of the Reber grammar. 



Table 1: The results of training (mean of 50 trials). A percentage in round brackets represents the ratio 
of trained networks whose maximum Lyapunov exponent is positive. When we evaluated a Lyapunov 
exponent, we computed 200 sample sequences of 10, 000 time steps for each trained network. 



Number of 


Error 


Kullback-Leibler divert 


;ence £'^l'((^")I=iI1(^")^=i) 


Lyapunov 


context neurons 


E9 


(for training data) 


(for test data) 


exponent 


10 


0.144868 


2.511791 


2.388665 


0.203512 (88%) 


20 


0.010118 


0.258037 


0.309089 


0.215117 (98%) 


30 


0.005721 


0.051233 


0.327273 


0.160728 (100%) 


50 


0.004339 


0.006338 


0.290858 


0.116574(100%) 
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Figure 5: (a) Transition rule of the training data. Each figure represents the trajectory of a Lissajous 
curve, and P denotes the transition probabiUty between curves, (b) An example of the training data. 



3.2 Learning stochastic switching of Lissajous curves 

In the current section, we consider two tasks, which are to generate time series stochastically combining 
Lissajous curves. Assume that for each task, transitions among curves are consonant with orbit 
continuity. Since we consider a model which has a feedback loop with time delay r, every training 
data set D = {X,Y) satisfies y„ = Xn+r- 

3.2.1 Two Lissajous curves case 

The first task is to learn 2-dimensional sequences generated by Markov chain switching of 2 Lissajous 
curves of period 25 (see Figure [5]). We use 10 training sequences each of length T = 1000 and with 
transition probability P — 0.5. The time delay of the feedback loop is r = 1. The number of experts 
is = 2. There are 10 context neurons for each expert and 15 for the gating network. The parameter 
settings are e = 0.2, = 0.04, a = 0.05, <^ = 10, 5 = 0.01 and t] = 0.9. 

Figure [6] depicts an example of results for training the mixture of RNN experts with this task. 
Here, we computed learning for experts up to 30000 steps, and learning for the gating network up to 
10000 steps. Figure [7] and [8] depict trajectories of the trained network computed from the closed-loop 
dynamics. Figure [7] indicates that the trained network can generate trajectories similar to the training 
data. Moreover, from Figure [6] (d) and Figure [3 it can be observed that the network has acquired 
stochastic switching between Lissajous curves using a chaotic dynamics. 

In Tableland Figure [51 to show that the model can acquire probabilities of transition, we compare 
cases in which the training data are generated by different transition probabilities. Here, we present 
evaluated results of 30 samples for each case with the same parameters as for the previous training. 
Table [2] describes transition probabilities between curves and maximum Lyapunov exponents, and 
Figure [9] also describes probability distributions for the number of repetitions of each curve. These 
results confirm that the mixture of RNN experts can acquire the transition probabilities of the task. 
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Figure 6: (a) The error of experts (b) The error of the gating network i^f. (c) The Kullback-Leibler 
divergence D^j^{{sn)n=i\\{sn)n=i) ■ i'^) Maximum Lyapunov exponent. 
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Figure 7: (a) Teaching data, (b) A trajectory of the mixture of RNN experts with closed-loop dynamics 
(plotted over 10,000 time steps after convergence of the transients). 
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Table 2: The results of training (mean of 30 trials). Here, a probability refers to a transition probability 
between curves. When we evaluated the probability and the Lyapunov exponent, we computed 50 
sample sequences of 100,000 time steps for each network. 



Probability 


Probability 


Lyapunov 


(task) 


(trained network) 


exponent 


0.5 


0.451033 


0.001076 (93%) 


0.25 


0.275999 


0.002080 (96%) 


0.125 


0.150866 


0.002829 (96%) 
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Figure 9: Distributions of the number of repetitions of each Lissajous curve, for each transition 
probability P of training data, (a) P = 0.5, (b) P = 0.25, (c) P = 0.125. Here, a histogram describes 
an averaged distribution of a whole set of trained networks, and a line describes the theoretical 
distribution of training data. When we evaluated the distributions, we computed 50 sample sequences 
of 100, 000 time steps for each network. 



3.2.2 Multiple Lissajous curves case 

The second task is to learn time series defined by switching among 9 Lissajous curves of period 32 
(see Figure [TUl) . In the task, we use a training sequence of length T = 10000 with a time delay of 
feedback t = 5. We have N — 24 experts, each with 10 context neurons, and the gating network has 
30 context neurons. 

In Figure [TTl we display an example of training, where parameters were set as follows: e = 0.1, 
= 0.04, a = 0.05, <i — 1, a — 0.01 and rj — 0.9. It can be seen that the maximum Lyapunov exponent 
oscillates between positive and negative values after 1000 learning steps; concurrently the KuUback- 
Leibler divergence changes drastically. This phenomenon implies that properties of the network as a 
dynamical system are changed by updating parameters, and the network is not structurally stable. 
In the case, the stability of varying a likelihood (or an error) is not corresponding to that of varying 
the properties of the model. Actually, in Figure [Tl] the error decreases monotonically, whereas the 
maximum Lyapunov exponent oscillates. In general, since the topological property of a dynamical 
system being not structurally stable is changed by perturbations [T] , it is difficult to prevent oscillation 
of the maximum Lyapunov exponent even when the learning rate is very small. In section 231 we will 
explain and discuss this problem of the structural stability of learning models in detail. 

Figures [T^ and [T3] represent closed-loop dynamics of the trained network with a random initial 
state, where the network has a positive Lyapunov exponent. It can be seen in Figure fT2l (b) that there 
are two groups of experts; one group represents primitives, and the other represents transition paths 
between primitives. Furthermore, taking into account the whole set of transition paths, the transition 
rule for the network described in Figure [T^] (b) is similar to that for the training data described in 
Figure [TO] (b). 
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Figure 10: (a) Training data generated by switching among 9 Lissajous curves, (b) Each Lissajous 
curve. Transition among curves is consonant with continuity of the orbit. Keeping a current curve or 
moving to another curve is chosen randomly with equal probability. 




Learning step 



Figure 11: (a) The error of the experts E. (b) The error of the gating network . (c) KuUback-Leibler 
divergence £'^L((^n)n=ill(*n)n=i)- (d) Maximum Lyapunov exponent. 
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Figure 12: (a) A trajectory generated by the trained network in the closed-loop dynamics (plotted 
over 10,000 time steps after convergence of the transients), (b) Output of experts and transition 
paths. The output of an expert i is plotted if s„ — i, namely, if gate i opens at time n. If gate i is 
never opened, then no output of the expert i is plotted. The graphs enclosed by blue boxes represent 
Lissajous curves as primitives, and other graphs represent paths connecting curves. 
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Figure 13: A snapshot of output y„, gate opening vector g„ and index of a current opening gate 
of the trained network. 
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3.3 Summary of results 

In the preliminary experiment, we tested the gating network for its capabihty to learn stochastic 
sequences given by a finite state automaton with a chaotic dynamics. The result shows that, by 
increasing the number of context neurons, the KuUback-Leibler divergence can be reduced and the 
frequency of appearance of chaos also increases. 

For the next experiment, we considered the task of learning stochastic time series consisting of 
Lissajous curves. In the case of two Lissajous curves, a mixture of RNN experts can imitate by 
chaos not only nondeterministic characteristics but a probability distribution underlying training 
data. Moreover, the networks can maintain a chaotic dynamics stably at the end of learning. In the 
case of multiple Lissajous curves, the trained network can generate a similar sequence to the training 
data, though the maximum Lyapunov exponent is not always positive in the last learning phase. Since 
this instability characteristic of chaos is due to the structurally unstable dynamics of the network, we 
will discuss this problem of structural stability in section 14.31 

4 Analysis and Discussion 

In this paper, we have investigated learning processes of a gating network contained in the mixture 
of RNN experts model. In particular, we have focused on how the gating network learns stochastic 
sequences brought by an arbitrary composition of primitives by self-organizing chaos. The fact that 
the gating network can learn stochastic time series is important, because it had been considered that 
it is impossible for RNNs of deterministic dynamical systems to learn nondeterministic time series. 

In the following we discuss some general characteristics in the learning of stochastic time series by 
embedding into chaos. We focus on two essential problems, related to memory capacity and to the 
structural stability of RNNs, and explain how the problems can be overcome. 

4.1 A problem in learning stochastic time series by means of a determin- 
istic but chaotic dynamics 

Here, we explain a problem that appears in particular when stochastic time series are learned by a 
dynamical system. To examine the problem, we consider the difficulty of learning of time series by 
RNNs for several categories of target training data sequences. 

Firstly, we consider the case in which training data are given by deterministic dynamical systems 
including chaos, in contrast with the case of nondeterministic time series. If the training data contain 
all degrees of freedom, then further states are uniquely determined by a current state. Thus, the task of 
generating the time series is equivalent to the approximation of an input-output function, and it is not 
necessary for successful learning that the model has internal states with recurrent connections. On the 
other hand, if the training data contain only a subset of the degrees of freedom, then further states 
cannot be determined uniquely by a current state. However, by the Takens' embedding theorem, 
the training data can often be embedded in another Euclidean space, in which time evolution is 
deterministic, by using the method of delays. Accordingly, even if training data realize only a subset 
of the degrees of freedom, we can still reduce the task to the approximation of an input-output function 
by reconstructing the state space. From this fact, the training data can be learned by a model which 
has a contraction mapping with respect to internal states (e.g., an echo state network), because the 
contraction mapping plays the role of embedding past time series in the internal states. The successful 
learning of Mackey-Glass chaotic time series by an echo state network is a remarkable example pU] . 

By way of contrast, nondeterministic sequences generated by a stochastic process cannot be trans- 
formed to deterministic time series by reconstructing the state space with the method of delays. Hence, 
the training data of nondeterministic sequences cannot be learned in the same way as deterministic 
sequences. One of the ways to output nondeterministic sequences using a dynamical system model 
is for the model to exhibit chaotic dynamics with respect to internal states, and for the output of 
the model to consist of a readout of the internal states. If encoded information of nondeterministic 
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sequences can be embedded in the initial states with enough precision to predict future values, nonde- 
terminacy of the sequences can be imitated by "sensitive dependence on initial conditions" , which was 
discovered by Edward Lorenz in the early 1960s. For instance, RNNs utilizing sensitive dependence 
on initial conditions can learn nondeterministic sequences (e.g., an experimental study in 22 or our 
simulation in section [3]). Learning for such models is, however, often unstable, for the following rea- 
son. In order to train such models, a set of learnable parameters has to include initial states, because 
information about sequences is embedded in the initial states. If a trained model is governed by a 
chaotic dynamics, almost any minute change of an initial state brings about a drastic change in the 
output sequence, due to the initial sensitivity of the chaotic dynamics. Hence this initial sensitivity 
introduces instability into learning processes. 

4.2 Memory capacity of initial states for embedding sequence information 

As described in section 14.11 in order for a dynamical systems model utilizing sensitive dependence 
on initial conditions to successfully learn nondeterministic sequences, it has to be possible to embed 
information about sequences in the initial states with enough precision to predict future values. How- 
ever, initial states containing enough information to enable prediction is not a sufficient condition 
for successful learning, because there is the possibility for trained models to memorize exactly all 
the training data but not to acquire any chaotic dynamics. We intuitively expect that a sufficient 
amount of training data will be necessary to force a more general model; i.e., one that has acquired an 
adequate chaotic dynamics. We will discuss conditions for obtaining a chaotic dynamics taking into 
account both the amount of training data and the memory capacity of internal states. 

To examine conditions for obtaining chaotic dynamics by learning for a mixture of RNN experts, 
we present additional simulation results. In the following, we refer to the number of training sequences, 
the step length of each training sequence and the number of context neurons in a gating network as 
NT, LT, and NC, respectively. Here, we repeat the learning simulation of the task in section 13.21 
with the transition probability P set as 0.5, with varying NT, LT and NC. In addition, in order to 
measure the memory capacity of initial states, we define an error E'^iosod ^^j, closed-loop dynamics 
such that 



where y^^^"^ denotes the training data, and y„ denotes output of the model computed for a closed- 
loop dynamics with a trained initial state. If becomes sufficiently small, the mixture of RNN 
experts can output sequences almost the same as the training data, namely, the training data are 
embedded in initial states. Figure [14] shows the results of training for each parameter setting such 
as the number of sequences NT, their step length LT and the number of context neurons NC. In 
this figure, we display the average of the error i<;cioscd ^^^^ ^^^^ ratio of networks whose maximum 
Lyapunov exponent is positive, for 10 samples for each parameter setting. The results in Figure[T4] (a) 
clearly indicate that the error 

£iciosed depends on both NT and LT, and that decreases with 

increasing NC. In particular, E'^^°^°'^ increases if NT and LT are increased in the case of NC = 10. 
This result is due to the fact that error margins cannot be corrected because of less memory capacity 
in initial states. The results in Figure [T3] (b) indicate that chaos tends to appear if both NT and 
LT are increased. In addition, by increasing NC, generating chaos tends to require a greater amount 
of training data. It can be inferred that overabundant memory capacity enables the networks to 
memorize exactly all training data without extracting an adequate chaotic dynamics from the data. 
Thus, if networks can exhibit sufficient memory capacity to embed training sequences in initial states, 
a small number of context neurons tends to generalize an adequate chaotic dynamics from the training 
data. On the other hand, in the case of 10 context neurons, the trained networks did not acquire a 
chaotic dynamics even if subjected to a great deal of training data. This result implies that acquiring 
chaos tends to fail if the memory capacity of a network is insufficient to embed training sequences. If 
a task demands a large amount of training data, a sufficient number of context neurons is necessary 
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(b) Ratio of positive Lyapunov exponent 

Figure 14: Phase plots of the error ijciosed ^^^^ ^j^g ratio of networks whose maximum Lyapunov 
exponent is positive. The upper figures display the error i<;ciosod^ ^^^^^ ^j^g lower figures display the 
ratio of positive Lyapunov exponent. In the figures, NC denotes the number of context neurons in 
a gating network, along the vertical axis is the length of each training sequence LT, and along the 
horizontal axis is the number of training sequences NT. 



to successfully learn the task. Therefore, these results imply that it is necessary to balance a sufficient 
amount of training data with the memory capacity, in order to embed the target stochastic sequences 
into chaos. 

Finally, we explain the reason that long sequences can be encoded in initial states in our model. It 
is known that the BPTT method cannot propagate effectively error signals over long-time windows, 
because information in the error signals decreases exponentially through the sigmoid function [2] . For 
instance, in studies [T71[52] all training data learned by RNNs were of length less than 100. Our model, 
however, successfully learns training data whose length is over 10, 000. The difference between the 
situation in the previous studies and our situation is brought about by the time constant in equation 
([5]) for updating context neural states. Since the internal potential of a context neuron decays slowly 
by the effect of e^, our model can keep values of context neurons for long time. The internal potentials 
of context neurons with time constant play a role similar to that of linear integrators within special 
units of LSTM [71[2T]. While imits of LSTM can retain values without any damping, context neurons 
with lose information through updating states. Conversely, since the internal potentials of context 
neurons with are bounded by damping, the potential values do not diverge to infinity, and so 
the dynamics of our model is more stable in many situations. As potential nonlinearity of a chaotic 
dynamics often forces divergence on a learning model, our model is desirable when a chaotic dynamics 
is learned. 
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4.3 Structural stability of learning models 

In the above discussion, we demonstrated that the proposed model can have enough memory capacity 
to embed long sequences in initial states. However, as we saw in the results of experiments, chaos does 
not always appear even if both a sufficient amount of training data and memory capacity are given and 
the error is made sufficiently small. When learning models have to obtain chaotic dynamic, the problem 
of structural stability of the learning models appears. The notion of structural stability was formulated 
by Andronov and Pontryagin [Ij on the basis of the belief that a model for a natural phenomenon must 
preserve at least its topological character under perturbation. Mathematically, a dynamical system is 
called structurally stable in the sense if any other dynamical system in the neighborhood of such 
a dynamical system is topologically conjugate with the dynamical system concerned. In terms of a 
learning procedure using the gradient descent method, if a model is structurally stable and the learning 
rate is sufficiently small, the model preserves its topological character when parameters are changed 
in the neighborhood of a local minimum. By contrast, in the case when a model is not structurally 
stable, convergence of a likelihood (or an error) does not correspond to convergence of the topological 
character of the model, and success or failure of learning cannot be identified until actually computing 
the time evolution. Therefore, structural stability of the model is preferable from the viewpoint of 
learning. However, taking into account the whole space of dynamical systems, structurally stable 
dynamical systems are not dense in this space. Hence a neural network, which is a kind of universal 
function approximator, is not structurally stable in general. For example, in the simulation described 
in Figure [Tl] the network seems not to be structurally stable, while in Figure [3] and Figure [6] the 
networks might be structurally stable at the end of training. If we use a universal approximator as a 
learning model, it is difficult to avoid occurrence of a model that is not structurally stable. 

Let us consider how to take steps to cope with the appearance of a model that is not structurally 
stable. One possible way is to add a negligible amount of noise to the dynamics of a trained model. For 
instance, even if the dynamics of a trained model is not exactly chaotic because it is not structurally 
stable, the model could acquire a dynamical structure whose neighborhood contained the chaotic 
dynamics. In this case, by adding low noise, the model can recure chaotic movements by making 
transverse intersections between stable and unstable manifolds. In addition, if the model is structurally 
stable, the topological character does not change for low noise by satisfying the shadowing property 
|19) . As an example. Figure [15] displays trajectories of a network, where the network is trained in 
section |4?2] and it seems not to be structurally stable. Figure [15] (a) shows a trajectory of the network 
after 10, 000 learning steps. In this case, the dynamics of the network is a limit cycle and the maximum 
Lyapunov exponent is negative. In Figure [15] (b), however, the dynamics of the network changes to 
chaos from a limit cycle after only 10 learning steps. Next, we added Gaussian white noise with 
variance = 0.0001 to the dynamics of the network having a limit cycle, and plotted a trajectory in 
Figure [15] (c) . Consequently, a dynamics similar to that of case (b) can be reconstructed by adding 
the noise. It is shown that a neighborhood of the network contains a chaotic system, and adding noise 
restores the network dynamics to the chaotic system while respecting the original topology. 

In Figure[T6l to evaluate the effect of low noise we display ratios of networks generating nonperiodic 
output sequences, where the networks were trained in section 14.21 and Gaussian white noise with 
variance was added to the input Xn for each time step n. Figures [16] (a) and (b) display the results 
for variances = and = 0.001, respectively. Nonperiodicity of output sequences implies that the 
dynamics is chaotic or quasi-periodic. To evaluate periodicity of output sequences, we labeled a left and 
right circle in output trajectories (see Figure [7]) as a symbol L and R, respectively. Forming symbolic 
sequences like LRLLRRLR- • • from output trajectories, we computed to determine periodicity of the 
symbolic sequences over a period of 100. Comparing Figure [7] (a) with (b), it can be seen that the 
proportion of networks generating nonperiodic sequences is increased by adding the noise. On the 
other hand, in the region corresponding to a small amount of training data, output is periodic even 
if noise is added. It turned out that by adding noise, output of a network becomes nonperiodic if the 
network is trained by using enough training data, whereas output is periodic if we use less training 
data. In addition, adding noise does not result only in nonperiodic sequences being generated, because 
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(a) (b) (c) 




Figure 15: Trajectories of two values of u-^ of a network with closed-loop dynamics, where the network 
is trained as in section 14.21 with 10 training sequences, the step length of each sequence being 1000 
and 30 context neurons. For each figure, we plotted over 10, 000 time steps after convergence of 
the transients, (a) A trajectory of the network after 10, 000 learning steps (the maximum Lyapunov 
exponent is —0.000054). (b) A trajectory of the network after 10,010 learning steps (the maximum 
Lyapunov exponent is 0.000995). Dynamics of the network changed to chaos from a limit cycle after 
only 10 learning steps, (c) A trajectory of the network the same as in case (a), but we added Gaussian 
white noise with variance = 0.0001 to the input Xn for each time step n. Chaotic dynamics appears 
when the noise is added. 



there is a region in which output is periodic even if noise is added. Thus, this result implies that in the 
neighborhood of a trained network subjected to a sufhcient amount of data, there exists an appropriate 
chaotic dynamics. Hence, we can classify trained networks into three classes from the results about 
their dynamics: (1) those that have a structurally stable and periodic dynamics, (2) those that have 
a structurally stable and chaotic dynamics and (3) those that are structurally unstable but with a 
nearby system having a chaotic dynamics. Models with low noise can generate chaotic time series and 
imitate nondeterministic characteristics underlying the training data not only in case (2) but also in 
case (3). 

5 Conclusion 

This paper has shown that a mixture of RNN experts model can acquire the ability to generate 
sequences combining multiple primitives. By virtue of training of the model, each expert learns a 
primitive sequence pattern, and a gating network learns a switching rule between winners to make 
decisions about which expert to use as a winner on each occasion. Our simulation experiments 
showed that the model imitates stochastic switching between the multiple primitives in terms of a 
chaotic dynamics utilizing sensitive dependence on initial conditions. It was also demonstrated that 
a self-organized chaotic system can reconstruct the probability of primitive switching as observed in 
the training data. We have also considered two essential problems concerning memory capacity and 
structural stability of the model. Through analysis of the network dynamics, we have inferred that a 
sufficient amount of training data, balanced with the memory capacity of the network is a necessary 
condition for embedding stochastic time series into a chaotic dynamical system. We have also inferred 
that even if the dynamics of the model is not exactly chaotic because of structural instability, adding 
a negligible amount of noise into the model can enable stably reconstruction of stochastic time series. 
The model can be applied to prediction and generation of time-series, especially when the time series 
to be learned are potentially able to be represented as a composition of several primitives, and there 
is no required pre-knowledge of the primitives or a composition rule for them. 
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Figure 16: The ratio of networks generating nonperiodic output sequences. Here, Gaussian white 
noise with variance is added to the input Xn for each time step, (a) Phase plots corresponding to 
the case of — 0, namely, without noise, (b) Phase plots corresponding to the case of p^ = 0.001. In 
these figures, NC denotes the number of context neurons in the gating network, the vertical axis is 
the length of each training sequence LT, and the horizontal axis is the number of training sequences 
NT. 
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